# importing all required lib for the Project
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
1.Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order.
2.Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase.
#Sol 1:
A = 400/2000
#Sol2:
B = (400/2000)/(500/2000)
print('Answer for 1st question',A)
print('Answer for 2nd question',B)
Answer for 1st question 0.2 Answer for 2nd question 0.8
An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.
A. Probability that none of the items are defective?
B. Probability that exactly one of the items is defective?
C. Probability that two or fewer of the items are defective?
D. Probability that three or more of the items are defective ?
## Question 2 falls under Biomial distibution
## Success rate is 0.95
## Failure rate is 0.05
## Sample n = 10
k1=0 # Zero defects
k2=1 # Exactly one defects
k3=[0,1,2] # two or fewer defects
k4=np.arange(3,11) # three or more defects
n=10
p=0.05
binomial1 = stats.binom.pmf(k1,n,p)
binomial2 = stats.binom.pmf(k2,n,p)
binomial3 = stats.binom.pmf(k3,n,p)
binomial4 = stats.binom.pmf(k4,n,p)
print('Probability that none of the items are defective',round(binomial1,4))
print('Probability that exactly one of the items are defective',round(binomial2,4))
print('Probability that two or fewer of the items are defective',round(sum(binomial3),4))
print('Probability that three or more of the items are defective',round(1-sum(binomial3),4))
Probability that none of the items are defective 0.5987 Probability that exactly one of the items are defective 0.3151 Probability that two or fewer of the items are defective 0.9885 Probability that three or more of the items are defective 0.0115
A car salesman sells on an average 3 cars per week.
A. Probability that in a given week he will sell some cars.
B. Probability that in a given week he will sell 2 or more but less than 5 cars.
C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold perweek.
# Question 3 falls under Poisson distribution
rate = 3
n0=0 #no cars he sell in a week
poisson_0 = stats.poisson.pmf(n0,rate)
n1=1-poisson_0 # in a week sells some cars i.e, 1 or more cars
n2=[2,3,4] #he will sell 2 or more but less than 5 cars.
poisson1 = n1
poisson2 = stats.poisson.pmf(n2,rate)
print('A) Probability that in a given week he will sell some cars is', round(poisson1,4))
print('B) Probability that in a given week he will sell 2 or more but less than 5 cars is',round(sum(poisson2),4))
A) Probability that in a given week he will sell some cars is 0.9502 B) Probability that in a given week he will sell 2 or more but less than 5 cars is 0.6161
n3=np.arange(0,18)
poisson_cdf = stats.poisson.cdf(n3,rate)
poisson_pmf = stats.poisson.pmf(n3,rate)
plt.plot(poisson_cdf,poisson_pmf, 'g*-')
plt.title('C) cumulative probability of cars sold per-week vs number of cars sold perweek')
plt.xlabel('CDF')
plt.ylabel('PMF')
plt.show()
C:\Users\Prem\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:238: RuntimeWarning: Glyph 2 missing from current font. font.set_text(s, 0.0, flags=flags) C:\Users\Prem\anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py:201: RuntimeWarning: Glyph 2 missing from current font. font.set_text(s, 0, flags=flags)
Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.
A. What is the probability that all three orders will be recognised correctly?
B. What is the probability that none of the three orders will be recognised correctly?
C. What is the probability that at least two of the three orders will be recognised correctly?
#Question 4 falls under binomial distribution
p1=0.868
n1=3
k5=np.arange(0,4)
binomial5 = stats.binom.pmf(k5,n1,p1)
print('A) All the three orders will be recognised correctly is %1.4f' %binomial5[3])
print('B) None orders will be recognised correctly is %1.4f' %binomial5[0])
print('C) At least two of the three orders will be recognised correctly is %1.4f' %(1-(binomial5[0]+binomial5[1])))
A) All the three orders will be recognised correctly is 0.6540 B) None orders will be recognised correctly is 0.0023 C) At least two of the three orders will be recognised correctly is 0.9523
A group of 300 professionals sat for a competitive exam. The results show the information of marks obtained by them have a mean of 60 and a standard deviation of 12. The pattern of marks follows a normal distribution. Answer the following questions.
A. What is the percentage of students who score more than 80.
B. What is the percentage of students who score less than 50.
C. What should be the distinction mark if the highest 10% of students are to be awarded distinction?
# Normal Distribution
mean = 60
std = 12
Norm=stats.norm.cdf(80,loc=mean,scale=std)
M= 1- Norm
N=stats.norm.cdf(50,loc=mean,scale=std)
#p(z>= (x-60)/12)=0.10 # We know the percentage and we need to find the marks
#using Z-score table for 0.9 is 1.282
#(X-60)/12=1.282
O=(1.282*12)+60
print(' A) Percentage of students who score more than 80 is %1.4f' % M)
print(' B) Percentage of students who score less than 50 is %1.4f' % N)
print(' C) Distinction mark if the highest 10% of students are to be awarded distinction is',O)
A) Percentage of students who score more than 80 is 0.0478 B) Percentage of students who score less than 50 is 0.2023 C) Distinction mark if the highest 10% of students are to be awarded distinction is 75.384
Explain 1 real life industry scenario [other than the ones mentioned above] where you can use the concepts learnt in this module of Applied statistics to get a data driven business solution
There are some of the examples to explain the role of statistic in real life.
1) Medical Study
Statistics are used behind all the medical study. Statistic help doctors keep track of where the baby should be in his/her mental development. Physician’s also use statistics to examine the effectiveness of treatments.
2) Weather Forecasts
Statistics are very important for observation, analysis and mathematical prediction models. Weather forecast models are built using statistics that compare prior weather conditions with current weather to forecast future weather conditions.
3) Quality Testing
A company makes thousands of products every day and make sure that they sold the best quality items. For a company it is not possible to test each product. So the company uses quality test with the help of statistics.
4) Stock Market
The stock market also uses statistical computer models for stock analysis. Stock analysts get the information about economy using statistics concepts.
5) Consumer Goods
Retailers keeps track of everything they sell and to know the stock using statistics. Worldwide leading retailers use statistics to calculate what products ship to each store and when.
#Since the data set has not defined NaN format. We can make '-' as NaN
missing_values=['-']
BBdata= pd.read_csv('C:\\Users\\Prem\\Favorites\\DS - Part2 - Basketball (1).csv',na_values=missing_values)
BBdata.head(10)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385.0 | 2762.0 | 1647.0 | 552.0 | 563.0 | 5947.0 | 3140.0 | 33.0 | 23.0 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262.0 | 2762.0 | 1581.0 | 573.0 | 608.0 | 5900.0 | 3114.0 | 25.0 | 25.0 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442.0 | 2614.0 | 1241.0 | 598.0 | 775.0 | 4534.0 | 3309.0 | 10.0 | 8.0 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386.0 | 2664.0 | 1187.0 | 616.0 | 861.0 | 4398.0 | 3469.0 | 6.0 | 6.0 | 1931to32 | 1 |
| 4 | Team 5 | 86 | 3368.0 | 2762.0 | 1209.0 | 633.0 | 920.0 | 4631.0 | 3700.0 | 8.0 | 7.0 | 1929 | 1 |
| 5 | Team 6 | 73 | 2819.0 | 2408.0 | 990.0 | 531.0 | 887.0 | 3680.0 | 3373.0 | 1.0 | 4.0 | 1934-35 | 1 |
| 6 | Team 7 | 82 | 2792.0 | 2626.0 | 948.0 | 608.0 | 1070.0 | 3609.0 | 3889.0 | NaN | NaN | 1929 | 3 |
| 7 | Team 8 | 70 | 2573.0 | 2302.0 | 864.0 | 577.0 | 861.0 | 3228.0 | 3230.0 | 2.0 | 3.0 | 1929 | 1 |
| 8 | Team 9 | 58 | 2109.0 | 1986.0 | 698.0 | 522.0 | 766.0 | 2683.0 | 2847.0 | NaN | 1.0 | 1939-40 | 2 |
| 9 | Team 10 | 51 | 1884.0 | 1728.0 | 606.0 | 440.0 | 682.0 | 2159.0 | 2492.0 | 1.0 | NaN | 1932-33 | 1 |
BBdata.describe(include='all')
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 61 | 61.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 60.000000 | 9.000000 | 13.000000 | 61 | 61.000000 |
| unique | 61 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 47 | NaN |
| top | Team 18 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1929 | NaN |
| freq | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 10 | NaN |
| mean | NaN | 24.000000 | 916.450000 | 810.100000 | 309.033333 | 192.083333 | 308.816667 | 1159.350000 | 1159.233333 | 9.666667 | 6.615385 | NaN | 7.081967 |
| std | NaN | 26.827225 | 1138.342899 | 877.465393 | 408.481395 | 201.985508 | 294.508639 | 1512.063948 | 1163.946914 | 11.618950 | 8.109033 | NaN | 5.276663 |
| min | NaN | 1.000000 | 14.000000 | 30.000000 | 5.000000 | 4.000000 | 15.000000 | 34.000000 | 55.000000 | 1.000000 | 1.000000 | NaN | 1.000000 |
| 25% | NaN | 4.000000 | 104.250000 | 115.500000 | 34.750000 | 26.250000 | 62.750000 | 154.500000 | 236.000000 | 1.000000 | 1.000000 | NaN | 3.000000 |
| 50% | NaN | 12.000000 | 395.500000 | 424.500000 | 124.000000 | 98.500000 | 197.500000 | 444.000000 | 632.500000 | 6.000000 | 4.000000 | NaN | 6.000000 |
| 75% | NaN | 38.000000 | 1360.500000 | 1345.500000 | 432.750000 | 331.500000 | 563.500000 | 1669.750000 | 2001.250000 | 10.000000 | 7.000000 | NaN | 10.000000 |
| max | NaN | 86.000000 | 4385.000000 | 2762.000000 | 1647.000000 | 633.000000 | 1070.000000 | 5947.000000 | 3889.000000 | 33.000000 | 25.000000 | NaN | 20.000000 |
BBdata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 60 non-null float64 3 PlayedGames 60 non-null float64 4 WonGames 60 non-null float64 5 DrawnGames 60 non-null float64 6 LostGames 60 non-null float64 7 BasketScored 60 non-null float64 8 BasketGiven 60 non-null float64 9 TournamentChampion 9 non-null float64 10 Runner-up 13 non-null float64 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: float64(9), int64(2), object(2) memory usage: 6.3+ KB
BBdata.isnull().sum()
Team 0 Tournament 0 Score 1 PlayedGames 1 WonGames 1 DrawnGames 1 LostGames 1 BasketScored 1 BasketGiven 1 TournamentChampion 52 Runner-up 48 TeamLaunch 0 HighestPositionHeld 0 dtype: int64
# Function to calculate missing values by column
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# Rename the columns
mis_val_table_ren_columns = mis_val_table.rename(
columns = {0 : 'Missing Values', 1 : '% of Total Values'})
# Sort the table by percentage of missing descending
mis_val_table_ren_columns = mis_val_table_ren_columns[
mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_val_table_ren_columns
missing_values_table(BBdata)
Your selected dataframe has 13 columns. There are 9 columns that have missing values.
| Missing Values | % of Total Values | |
|---|---|---|
| TournamentChampion | 52 | 85.2 |
| Runner-up | 48 | 78.7 |
| Score | 1 | 1.6 |
| PlayedGames | 1 | 1.6 |
| WonGames | 1 | 1.6 |
| DrawnGames | 1 | 1.6 |
| LostGames | 1 | 1.6 |
| BasketScored | 1 | 1.6 |
| BasketGiven | 1 | 1.6 |
There are 85.2% missing value in TournamentChampion & 78.7% in Runner-Up. So,drop that columns in the data set. Rest are small percentage of missing values. So, it can be filled with mean of the columns
##copying the df to another df
BBdata1 = BBdata.copy()
BBdata1.drop('TournamentChampion',axis=1,inplace=True)
BBdata1.drop('Runner-up',axis=1,inplace=True)
BBdata1['Score'].fillna(BBdata1.Score.mean(),inplace=True) # fill miss value with mean.
BBdata1['PlayedGames'].fillna(BBdata1.PlayedGames.mean(),inplace=True)
BBdata1['WonGames'].fillna(BBdata1.WonGames.mean(),inplace=True)
BBdata1['DrawnGames'].fillna(BBdata1.DrawnGames.mean(),inplace=True)
BBdata1['LostGames'].fillna(BBdata1.LostGames.mean(),inplace=True)
BBdata1['BasketScored'].fillna(BBdata1.BasketScored.mean(),inplace=True)
BBdata1['BasketGiven'].fillna(BBdata1.BasketGiven.mean(),inplace=True)
#changing the float type to int
BBdata1 = BBdata1.astype({"Score":'int', "PlayedGames":'int',"WonGames":'int',"DrawnGames":'int',
"LostGames":'int',"BasketScored":'int',"BasketGiven":'int'})
#Final dataset will null zero
BBdata1 = BBdata1.drop(BBdata1.index[60])
UD = BBdata1['PlayedGames']
len(UD)
60
UD.isnull().sum()
0
plt.hist(UD, bins=40)
(array([12., 7., 4., 2., 2., 4., 3., 0., 2., 3., 1., 0., 0.,
0., 1., 1., 0., 2., 1., 0., 2., 2., 0., 0., 2., 0.,
0., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 2.,
3.]),
array([ 30. , 98.3, 166.6, 234.9, 303.2, 371.5, 439.8, 508.1,
576.4, 644.7, 713. , 781.3, 849.6, 917.9, 986.2, 1054.5,
1122.8, 1191.1, 1259.4, 1327.7, 1396. , 1464.3, 1532.6, 1600.9,
1669.2, 1737.5, 1805.8, 1874.1, 1942.4, 2010.7, 2079. , 2147.3,
2215.6, 2283.9, 2352.2, 2420.5, 2488.8, 2557.1, 2625.4, 2693.7,
2762. ]),
<BarContainer object of 40 artists>)
sns.distplot(UD)
C:\Users\Prem\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='PlayedGames', ylabel='Density'>
plt.gcf().set_size_inches(10,3)
sns.boxplot(BBdata1['PlayedGames'],color='g');
C:\Users\Prem\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
sns.violinplot(UD)
C:\Users\Prem\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='PlayedGames'>
plt.figure(figsize=(20,10)) # makes the plot wider
plt.hist(UD, color='g') # plots a simple histogram
plt.axvline(UD.mean(), color='m', linewidth=3)
plt.axvline(UD.median(), color='b', linestyle='dashed', linewidth=3)
plt.axvline(UD.mode()[0], color='r', linestyle='dashed', linewidth=3)
<matplotlib.lines.Line2D at 0x263610213d0>
sns.pairplot(BBdata1, kind='reg')
<seaborn.axisgrid.PairGrid at 0x26364578460>
BBdata1.cov()
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|
| Tournament | 722.782768 | 3.003033e+04 | 2.355925e+04 | 10671.614124 | 5372.170904 | 7516.715537 | 3.962956e+04 | 3.090991e+04 | -101.172034 |
| Score | 30030.333051 | 1.295825e+06 | 9.785666e+05 | 463704.696610 | 220292.385593 | 294612.626271 | 1.715454e+06 | 1.250509e+06 | -4049.124576 |
| PlayedGames | 23559.249153 | 9.785666e+05 | 7.699455e+05 | 346774.216949 | 176166.940678 | 247031.154237 | 1.286941e+06 | 1.011131e+06 | -3316.361017 |
| WonGames | 10671.614124 | 4.637047e+05 | 3.467742e+05 | 166857.049718 | 77508.658192 | 102428.853672 | 6.172252e+05 | 4.401460e+05 | -1406.527119 |
| DrawnGames | 5372.170904 | 2.202924e+05 | 1.761669e+05 | 77508.658192 | 40798.145480 | 57867.608757 | 2.877725e+05 | 2.333322e+05 | -773.258475 |
| LostGames | 7516.715537 | 2.946126e+05 | 2.470312e+05 | 102428.853672 | 57867.608757 | 86735.338701 | 3.820110e+05 | 3.376645e+05 | -1136.736441 |
| BasketScored | 39629.558475 | 1.715454e+06 | 1.286941e+06 | 617225.191525 | 287772.461864 | 382010.997458 | 2.286337e+06 | 1.638664e+06 | -5213.356780 |
| BasketGiven | 30909.909040 | 1.250509e+06 | 1.011131e+06 | 440145.958192 | 233332.166667 | 337664.535028 | 1.638664e+06 | 1.354772e+06 | -4499.689831 |
| HighestPositionHeld | -101.172034 | -4.049125e+03 | -3.316361e+03 | -1406.527119 | -773.258475 | -1136.736441 | -5.213357e+03 | -4.499690e+03 | 28.251695 |
BBdata1.corr()
| Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|
| Tournament | 1.000000 | 0.981258 | 0.998683 | 0.971749 | 0.989295 | 0.949350 | 0.974867 | 0.987781 | -0.708002 |
| Score | 0.981258 | 1.000000 | 0.979687 | 0.997232 | 0.958090 | 0.878780 | 0.996634 | 0.943801 | -0.669215 |
| PlayedGames | 0.998683 | 0.979687 | 1.000000 | 0.967486 | 0.993972 | 0.955925 | 0.969970 | 0.990020 | -0.711065 |
| WonGames | 0.971749 | 0.997232 | 0.967486 | 1.000000 | 0.939416 | 0.851436 | 0.999312 | 0.925745 | -0.647819 |
| DrawnGames | 0.989295 | 0.958090 | 0.993972 | 0.939416 | 1.000000 | 0.972786 | 0.942234 | 0.992479 | -0.720248 |
| LostGames | 0.949350 | 0.878780 | 0.955925 | 0.851436 | 0.972786 | 1.000000 | 0.857843 | 0.985041 | -0.726172 |
| BasketScored | 0.974867 | 0.996634 | 0.969970 | 0.999312 | 0.942234 | 0.857843 | 1.000000 | 0.931079 | -0.648672 |
| BasketGiven | 0.987781 | 0.943801 | 0.990020 | 0.925745 | 0.992479 | 0.985041 | 0.931079 | 1.000000 | -0.727323 |
| HighestPositionHeld | -0.708002 | -0.669215 | -0.711065 | -0.647819 | -0.720248 | -0.726172 | -0.648672 | -0.727323 | 1.000000 |
plt.figure(figsize=(10,5))
sns.heatmap(BBdata1.corr(), annot=True, linewidths=.5, fmt= '.2f', center = 1 ) # heatmap
plt.show()
!pip install pandas-profiling
Collecting pandas-profiling Downloading pandas_profiling-3.0.0-py2.py3-none-any.whl (248 kB) Requirement already satisfied: PyYAML>=5.0.0 in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (5.4.1) Collecting pydantic>=1.8.1 Downloading pydantic-1.8.2-cp38-cp38-win_amd64.whl (2.0 MB) Requirement already satisfied: joblib in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (1.0.1) Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (1.2.4) Requirement already satisfied: seaborn>=0.10.1 in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (0.11.1) Requirement already satisfied: jinja2>=2.11.1 in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (2.11.3) Collecting visions[type_image_path]==0.7.1 Downloading visions-0.7.1-py3-none-any.whl (102 kB) Requirement already satisfied: scipy>=1.4.1 in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (1.6.2) Collecting missingno>=0.4.2 Downloading missingno-0.5.0-py3-none-any.whl (8.8 kB) Requirement already satisfied: numpy>=1.16.0 in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (1.20.1) Requirement already satisfied: matplotlib>=3.2.0 in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (3.3.4) Collecting htmlmin>=0.1.12 Downloading htmlmin-0.1.12.tar.gz (19 kB) Collecting tangled-up-in-unicode==0.1.0 Downloading tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1 MB) Requirement already satisfied: requests>=2.24.0 in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (2.25.1) Requirement already satisfied: tqdm>=4.48.2 in c:\users\prem\anaconda3\lib\site-packages (from pandas-profiling) (4.59.0) Collecting phik>=0.11.1 Downloading phik-0.12.0-cp38-cp38-win_amd64.whl (659 kB) Requirement already satisfied: attrs>=19.3.0 in c:\users\prem\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (20.3.0) Collecting multimethod==1.4 Downloading multimethod-1.4-py2.py3-none-any.whl (7.3 kB) Requirement already satisfied: bottleneck in c:\users\prem\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (1.3.2) Requirement already satisfied: networkx>=2.4 in c:\users\prem\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (2.5) Requirement already satisfied: Pillow in c:\users\prem\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (8.2.0) Collecting imagehash Downloading ImageHash-4.2.1.tar.gz (812 kB) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\prem\anaconda3\lib\site-packages (from jinja2>=2.11.1->pandas-profiling) (1.1.1) Requirement already satisfied: cycler>=0.10 in c:\users\prem\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\prem\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (1.3.1) Requirement already satisfied: python-dateutil>=2.1 in c:\users\prem\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (2.8.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\prem\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (2.4.7) Requirement already satisfied: six in c:\users\prem\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib>=3.2.0->pandas-profiling) (1.15.0) Requirement already satisfied: decorator>=4.3.0 in c:\users\prem\anaconda3\lib\site-packages (from networkx>=2.4->visions[type_image_path]==0.7.1->pandas-profiling) (5.0.6) Requirement already satisfied: pytz>=2017.3 in c:\users\prem\anaconda3\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling) (2021.1) Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\prem\anaconda3\lib\site-packages (from pydantic>=1.8.1->pandas-profiling) (3.7.4.3) Requirement already satisfied: certifi>=2017.4.17 in c:\users\prem\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling) (2020.12.5) Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\prem\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling) (4.0.0) Requirement already satisfied: idna<3,>=2.5 in c:\users\prem\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling) (2.10) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\prem\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling) (1.26.4) Requirement already satisfied: PyWavelets in c:\users\prem\anaconda3\lib\site-packages (from imagehash->visions[type_image_path]==0.7.1->pandas-profiling) (1.1.1) Building wheels for collected packages: htmlmin, imagehash Building wheel for htmlmin (setup.py): started Building wheel for htmlmin (setup.py): finished with status 'done' Created wheel for htmlmin: filename=htmlmin-0.1.12-py3-none-any.whl size=27085 sha256=c050df2f0a0c39de11c42b4841a03e609db52c61ebf9c71b064bf9316b0703e8 Stored in directory: c:\users\prem\appdata\local\pip\cache\wheels\23\14\6e\4be5bfeeb027f4939a01764b48edd5996acf574b0913fe5243 Building wheel for imagehash (setup.py): started Building wheel for imagehash (setup.py): finished with status 'done' Created wheel for imagehash: filename=ImageHash-4.2.1-py2.py3-none-any.whl size=295198 sha256=a210790dbeef80ed59e42a494f8b6c420566de23e7bfaf170df6b95aebab09e8 Stored in directory: c:\users\prem\appdata\local\pip\cache\wheels\48\a1\7f\096c1269d6bf78d4768180602579b35a1e8cb1250bb4b40c74 Successfully built htmlmin imagehash Installing collected packages: tangled-up-in-unicode, multimethod, visions, imagehash, pydantic, phik, missingno, htmlmin, pandas-profiling Successfully installed htmlmin-0.1.12 imagehash-4.2.1 missingno-0.5.0 multimethod-1.4 pandas-profiling-3.0.0 phik-0.12.0 pydantic-1.8.2 tangled-up-in-unicode-0.1.0 visions-0.7.1
import pandas_profiling
pandas_profiling.ProfileReport(BBdata1)
BBdata1.skew()
Tournament 1.197176 Score 1.574104 PlayedGames 1.123454 WonGames 1.786067 DrawnGames 0.984899 LostGames 0.880596 BasketScored 1.758058 BasketGiven 0.958164 HighestPositionHeld 0.832164 dtype: float64
plt.figure(figsize=(20,20)) # setting the figure size
ax = sns.barplot(x='WonGames', y='Team', data=BBdata1, palette='muted') # barplot
#Since the data set has not defined NaN format. We can make '-' as NaN
missing_values=['-']
COMdata= pd.read_csv('C:\\Users\\Prem\\Favorites\\DS - Part3 - CompanyX_EU.csv',na_values=missing_values)
COMdata.head(20)
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
| 5 | 5to1 | 5to1.com | $19.3M | TC50 2009 | Contestant | Acquired |
| 6 | 8 Securities | 8securities.com | $29M | Disrupt Beijing 2011 | Finalist | Operating |
| 7 | 8020 Media | 8020media.com | NaN | TC40 2007 | Contestant | Operating |
| 8 | About Last Night | aboutlastnight.net | NaN | Disrupt NYC 2012 | Contestant | Operating |
| 9 | Adgregate Markets | adgregate.com | NaN | TC50 2008 | Contestant | Operating |
| 10 | AdhereTech | adheretech.com | $1.8M | Hardware Battlefield 2014 | Contestant | Operating |
| 11 | AdRocket | adrocket.com | $1M | TC50 2008 | Contestant | Closed |
| 12 | Affective Interfaces | affectiveinterfaces.com | NaN | TC50 2009 | Contestant | Operating |
| 13 | Agrilyst | agrilyst.com | $1M | Disrupt SF 2015 | Winner | Operating |
| 14 | Aiden | aiden.ai | $750K | Disrupt London 2016 | Contestant | Operating |
| 15 | AirBoxLab | foobot.io | $17.9K | Hardware Battlefield 2014 | Contestant | Operating |
| 16 | Aircall | aircall.io | $11.6M | Disrupt SF 2015 | Contestant | Operating |
| 17 | AirDroids | airdroids.com | $929.2K | Hardware Battlefield 2014 | Contestant | Closed |
| 18 | AirHelp | airhelp.com | $12.2M | Disrupt NYC 2014 | Contestant | Operating |
| 19 | AirWander | airwander.com | NaN | Disrupt London 2016 | Contestant | Operating |
COMdata.shape
(662, 6)
COMdata.dtypes
Startup object Product object Funding object Event object Result object OperatingState object dtype: object
COMdata.isnull().sum()
Startup 0 Product 6 Funding 214 Event 13 Result 0 OperatingState 0 dtype: int64
COMdata1 = COMdata.dropna().copy(deep=True)
COMdata1.shape
(440, 6)
COMdata1.isnull().sum().sum()
0
COMdata1.loc[:,'Funds_in_million'] = COMdata1['Funding'].apply(lambda x: float(x[1:-1])/1000 if x[-1] == 'K' else (float(x[1:-1])*1000 if x[-1] == 'B' else float(x[1:-1])))
COMdata1
| Startup | Product | Funding | Event | Result | OperatingState | Funds_in_million | |
|---|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | 0.63 |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | 1.00 |
| 5 | 5to1 | 5to1.com | $19.3M | TC50 2009 | Contestant | Acquired | 19.30 |
| 6 | 8 Securities | 8securities.com | $29M | Disrupt Beijing 2011 | Finalist | Operating | 29.00 |
| 10 | AdhereTech | adheretech.com | $1.8M | Hardware Battlefield 2014 | Contestant | Operating | 1.80 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 656 | Zenefits | zenefits.com | $583.6M | Disrupt NYC 2013 | Finalist | Operating | 583.60 |
| 657 | Zivity | zivity.com | $8M | TC40 2007 | Contestant | Operating | 8.00 |
| 659 | Zocdoc | zocdoc.com | $223M | TC40 2007 | Contestant | Operating | 223.00 |
| 660 | Zula | zulaapp.com | $3.4M | Disrupt SF 2013 | Audience choice | Operating | 3.40 |
| 661 | Zumper | zumper.com | $31.5M | Disrupt SF 2012 | Finalist | Operating | 31.50 |
440 rows × 7 columns
fundplot = plt.boxplot(COMdata1['Funds_in_million']);
plt.title('Boxplot for Funding')
plt.ylabel('Funds_in_million')
plt.show()
plt.gcf().set_size_inches(15,7)
sns.boxplot(COMdata1['Funds_in_million'],COMdata1['OperatingState']);
C:\Users\Prem\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
upper_fence = fundplot['caps'][1].get_data()[1][1] # we can use the values from the box plot itself to get the upper fence
upper_fence
22.0
print(f'Number of outliers = {len(COMdata1[COMdata1.Funds_in_million > upper_fence])}')
COMdata1[COMdata1.Funds_in_million > upper_fence]
Number of outliers = 60
| Startup | Product | Funding | Event | Result | OperatingState | Funds_in_million | |
|---|---|---|---|---|---|---|---|
| 6 | 8 Securities | 8securities.com | $29M | Disrupt Beijing 2011 | Finalist | Operating | 29.0 |
| 31 | Anyclip | anyclip.com | $24M | TC50 2009 | Finalist | Operating | 24.0 |
| 40 | Artsy | artsy.net | $50.9M | Disrupt NYC 2010 | Contestant | Operating | 50.9 |
| 49 | Badgeville | badgeville.com | $40M | Disrupt SF 2010 | Finalist | Acquired | 40.0 |
| 56 | Betterment | betterment.com | $205M | Disrupt NYC 2010 | Finalist | Operating | 205.0 |
| 108 | Clickable | clickable.com | $32.5M | TC40 2007 | Finalist | Acquired | 32.5 |
| 113 | Cloudflare | cloudflare.com | $182.1M | Disrupt SF 2010 | Runner up | Operating | 182.1 |
| 128 | Credit Sesame | creditsesame.com | $35.4M | Disrupt SF 2010 | Contestant | Operating | 35.4 |
| 130 | CrowdFlower Inc. | crowdflower.com | $38M | TC50 2009 | Contestant | Operating | 38.0 |
| 132 | Cubic Telecom | cubictelecom.com | $37.1M | TC40 2007 | Contestant | Operating | 37.1 |
| 138 | DataSift | datasift.com | $72M | Disrupt SF 2010 | Finalist | Operating | 72.0 |
| 139 | DataXu | dataxu.com | $64M | TC50 2009 | Contestant | Operating | 64.0 |
| 154 | Dropbox | dropbox.com | $1.7B | TC50 2008 | Contestant | Operating | 1700.0 |
| 166 | Enigma | enigma.io | $34.6M | Disrupt NYC 2013 | Winner | Operating | 34.6 |
| 172 | EverythingMe | everything.me | $35.5M | Disrupt NYC 2011 | Contestant | Closed | 35.5 |
| 179 | Famo.us | famo.us | $30.1M | Disrupt SF 2012 | Contestant | Operating | 30.1 |
| 180 | Farmigo | farmigo.com | $26M | Disrupt SF 2011 | Finalist | Operating | 26.0 |
| 188 | Fitbit | fitbit.com | $66M | TC50 2008 | Finalist | Ipo | 66.0 |
| 191 | Flock | flock.com | $28.3M | TC40 2007 | Contestant | Acquired | 28.3 |
| 209 | Getaround | getaround.com | $103M | Disrupt NYC 2011 | Winner | Operating | 103.0 |
| 213 | Gild | gild.com | $25.9M | Disrupt SF 2010 | Contestant | Acquired | 25.9 |
| 215 | Glide | glide.me | $36.5M | Disrupt NYC 2013 | Audience choice | Operating | 36.5 |
| 225 | Grockit | grockit.com | $44.7M | TC50 2008 | Finalist | Acquired | 44.7 |
| 231 | HackerRank | hackerrank.com | $24.2M | Disrupt SF 2012 | Contestant | Operating | 24.2 |
| 271 | Ionic Security | ionicsecurity.com | $122.4M | Disrupt SF 2012 | Contestant | Operating | 122.4 |
| 276 | IZEA | izea.com | $34.9M | Disrupt NYC 2010 | Contestant | Ipo | 34.9 |
| 279 | Jiff | jiff.com | $67.8M | Disrupt SF 2011 | Contestant | Acquired | 67.8 |
| 282 | Kaltura | kaltura.com | $166.1M | TC40 2007 | Audience choice | Operating | 166.1 |
| 302 | Layer | layer.com | $42.1M | Disrupt SF 2013 | Winner | Operating | 42.1 |
| 305 | LearnVest | learnvest.com | $69M | TC50 2009 | Contestant | Acquired | 69.0 |
| 313 | LiveIntent | liveintent.com | $65.1M | Disrupt NYC 2010 | Audience choice | Operating | 65.1 |
| 325 | Lystable | lystable.com | $25.1M | Disrupt London 2015 | Finalist | Operating | 25.1 |
| 346 | Mint | mint.com | $31.8M | TC40 2007 | Winner | Acquired | 31.8 |
| 389 | OrderWithMe | orderwithme.com | $37M | Disrupt Beijing 2011 | Winner | Operating | 37.0 |
| 393 | Osmo | playosmo.com | $38.5M | Disrupt SF 2013 | Contestant | Operating | 38.5 |
| 394 | Ossia | ossia.com | $25.5M | Disrupt SF 2013 | Runner up | Operating | 25.5 |
| 398 | Owlet Baby Care | owletcare.com | $24M | Hardware Battlefield 2014 | Runner up | Operating | 24.0 |
| 427 | Postmates | postmates.com | $278M | Disrupt SF 2011 | Contestant | Operating | 278.0 |
| 432 | Prism Skylabs | prism.com | $24M | Disrupt SF 2011 | Runner up | Operating | 24.0 |
| 435 | PubMatic | pubmatic.com | $63M | TC40 2007 | Contestant | Operating | 63.0 |
| 460 | Roadie | roadie.com | $25M | Disrupt NYC 2014 | Audience choice | Operating | 25.0 |
| 471 | SeatGeek | seatgeek.com | $160M | TC50 2009 | Contestant | Operating | 160.0 |
| 546 | StyleSeat | styleseat.com | $40M | Disrupt NYC 2011 | Contestant | Operating | 40.0 |
| 555 | Symphony Commerce | symphonycommerce.com | $47.4M | Disrupt NYC 2011 | Contestant | Operating | 47.4 |
| 560 | Talkdesk | talkdesk.com | $24.5M | Disrupt NYC 2012 | Contestant | Operating | 24.5 |
| 581 | TouchPal | touchpal.com | $25M | Disrupt Beijing 2011 | Finalist | Operating | 25.0 |
| 593 | TrueCar | truecar.com | $332.4M | TC50 2008 | Contestant | Ipo | 332.4 |
| 598 | UberConference | uberconference.com | $35M | Disrupt NYC 2012 | Winner | Operating | 35.0 |
| 606 | Upwork | upwork.com | $168.8M | TC50 2009 | Audience choice | Operating | 168.8 |
| 615 | VideoSurf | videosurf.com | $28M | TC50 2008 | Contestant | Acquired | 28.0 |
| 625 | Voxy | voxy.com | $30.8M | Disrupt SF 2010 | Contestant | Operating | 30.8 |
| 643 | Xobni | yahoo.com | $41.8M | TC40 2007 | Contestant | Acquired | 41.8 |
| 644 | Yammer | yammer.com | $142M | TC50 2008 | Winner | Acquired | 142.0 |
| 647 | Yext | yext.com | $117.8M | TC50 2009 | Contestant | Ipo | 117.8 |
| 649 | YouNow | younow.com | $26M | Disrupt SF 2011 | Contestant | Operating | 26.0 |
| 650 | YourMechanic | yourmechanic.com | $28M | Disrupt SF 2012 | Winner | Operating | 28.0 |
| 654 | ZEFR | zefr.com | $62.1M | Disrupt NYC 2010 | Contestant | Operating | 62.1 |
| 656 | Zenefits | zenefits.com | $583.6M | Disrupt NYC 2013 | Finalist | Operating | 583.6 |
| 659 | Zocdoc | zocdoc.com | $223M | TC40 2007 | Contestant | Operating | 223.0 |
| 661 | Zumper | zumper.com | $31.5M | Disrupt SF 2012 | Finalist | Operating | 31.5 |
COMdata1.drop(COMdata1[COMdata1.Funds_in_million > upper_fence].index,inplace=True)
COMdata1.shape
(380, 7)
plt.gcf().set_size_inches(15,7)
sns.boxplot(COMdata1['Funds_in_million'],COMdata1['OperatingState']);
C:\Users\Prem\anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
COMdata1.OperatingState.value_counts()
Operating 269 Closed 56 Acquired 55 Name: OperatingState, dtype: int64
sns.distplot(COMdata1.Funds_in_million)
plt.title('Distribution of funds raised across all companies')
plt.show()
C:\Users\Prem\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
sns.set_theme(style="darkgrid")
ax1 = sns.barplot(x='OperatingState', y='Funds_in_million', data=COMdata1, palette='bright') # barplot
fig, ax = plt.subplots(1,2)
fig.set_figheight(5)
fig.set_figwidth(15)
sns.distplot(COMdata1.loc[COMdata1.OperatingState == 'Operating','Funds_in_million'],ax=ax[0])
sns.distplot(COMdata1.loc[COMdata1.OperatingState == 'Closed','Funds_in_million'],ax=ax[1])
ax[0].set_title('Funds raised by the companies still operating')
ax[1].set_title('Funds raised by companies that got closed')
plt.show()
C:\Users\Prem\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\Prem\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Null hypothesis (Ho) : There is no difference between the two means
Alternate hypothesis (Ha) : There is significant difference between the two means
from statsmodels.stats.weightstats import ztest
sample1 = COMdata1.loc[COMdata1.OperatingState == 'Operating', 'Funds_in_million']
sample2 = COMdata1.loc[COMdata1.OperatingState =='Closed', 'Funds_in_million']
alpha = 0.05 # Let's consider a significance level of 5%
test_statistic, p_value = ztest(sample1, sample2)
if p_value <= alpha:
print(f'Since the p-value, {round(p_value, 3)} < {alpha} (alpha) the difference is significant and we reject the Null hypothesis')
else:
print(f'''\t Since the p-value, {round(p_value,3)} > {alpha} (alpha) the difference is not significant and,
\t we fail to reject the Null hypothesis''')
Since the p-value, 0.172 > 0.05 (alpha) the difference is not significant and,
we fail to reject the Null hypothesis
COMdata2 = COMdata1.copy()
COMdata2.head()
| Startup | Product | Funding | Event | Result | OperatingState | Funds_in_million | |
|---|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | 0.63 |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | 1.00 |
| 5 | 5to1 | 5to1.com | $19.3M | TC50 2009 | Contestant | Acquired | 19.30 |
| 10 | AdhereTech | adheretech.com | $1.8M | Hardware Battlefield 2014 | Contestant | Operating | 1.80 |
| 11 | AdRocket | adrocket.com | $1M | TC50 2008 | Contestant | Closed | 1.00 |
COMdata2.Result.value_counts()
Contestant 281 Finalist 51 Audience choice 18 Winner 16 Runner up 14 Name: Result, dtype: int64
winners = COMdata2.Result.value_counts()[1:].sum()
contestants = COMdata2.Result.value_counts()['Contestant']
contestants_operating = COMdata2.OperatingState[COMdata2.Result == 'Contestant'].value_counts().loc['Operating']
winners_operating = COMdata2.OperatingState[COMdata2.Result != 'Contestant'].value_counts().loc['Operating']
A=winners_operating/winners
B=contestants_operating/contestants
print('Percentage of winners that are still operating',round(A,3))
print('Percentage of contestants that are still operating',round(B,3))
Percentage of winners that are still operating 0.768 Percentage of contestants that are still operating 0.687
Null hypothesis (Ho): The proportion of companies that are operating is the same in both categories - winners and contestants
Alternative hypothesis (Ha): The proportion of companies that are operating is significantly different from each other, among the two categories
from statsmodels.stats.proportion import proportions_ztest
test_statistic, p_value = proportions_ztest([contestants_operating, winners_operating], [contestants, winners])
if p_value <= alpha:
print(f'Since the p-value, {round(p_value, 3)} < {alpha} (alpha) the difference is significant and we reject the Null hypothesis')
else:
print(f'''\t Since the p-value, {round(p_value,3)} > {alpha} (alpha) the difference is not significant and,
\t we fail to reject the Null hypothesis''')
Since the p-value, 0.128 > 0.05 (alpha) the difference is not significant and,
we fail to reject the Null hypothesis
COMdata2.Event.value_counts()
TC50 2008 25 TC40 2007 22 Disrupt NY 2015 21 Disrupt NYC 2013 19 Disrupt SF 2015 19 TC50 2009 19 Disrupt NYC 2012 19 Disrupt SF 2011 19 Disrupt SF 2013 19 Disrupt SF 2014 19 Disrupt SF 2016 17 Disrupt NY 2016 16 Disrupt SF 2012 15 Disrupt NYC 2014 15 Disrupt NYC 2011 15 Disrupt SF 2010 13 Hardware Battlefield 2014 12 Hardware Battlefield 2016 12 Disrupt London 2015 11 Hardware Battlefield 2015 10 Disrupt NYC 2010 10 Disrupt EU 2014 10 Disrupt London 2016 10 Disrupt EU 2013 9 Disrupt Beijing 2011 4 Name: Event, dtype: int64
COMdata2[COMdata2.Event.apply(lambda x: 'Disrupt' in x)].Event.value_counts()
Disrupt NY 2015 21 Disrupt SF 2011 19 Disrupt NYC 2013 19 Disrupt NYC 2012 19 Disrupt SF 2014 19 Disrupt SF 2015 19 Disrupt SF 2013 19 Disrupt SF 2016 17 Disrupt NY 2016 16 Disrupt SF 2012 15 Disrupt NYC 2014 15 Disrupt NYC 2011 15 Disrupt SF 2010 13 Disrupt London 2015 11 Disrupt London 2016 10 Disrupt NYC 2010 10 Disrupt EU 2014 10 Disrupt EU 2013 9 Disrupt Beijing 2011 4 Name: Event, dtype: int64
events = COMdata2[COMdata2.Event.apply(lambda x: 'Disrupt' in x and int(x[-4:]) > 2012)].Event
events
1 Disrupt NYC 2013
3 Disrupt NY 2016
13 Disrupt SF 2015
14 Disrupt London 2016
16 Disrupt SF 2015
...
635 Disrupt NY 2015
641 Disrupt NYC 2013
642 Disrupt SF 2014
646 Disrupt London 2015
660 Disrupt SF 2013
Name: Event, Length: 185, dtype: object
NYE = COMdata2.loc[events[events.apply(lambda x: 'NY' in x)].index,'Funds_in_million']
SFE = COMdata2.loc[events[events.apply(lambda x: 'SF' in x)].index,'Funds_in_million']
EUE = COMdata2.loc[events[events.apply(lambda x: 'EU' in x or 'London' in x)].index,'Funds_in_million']
print(len(NYE), len(SFE), len(EUE))
71 74 40
Null Hypothesis(Ho): Average funds raised by companies across three cities are the same
Alternative Hypothesis(Ha): Average funds raised by companies across three cities are the different
plt.figure(figsize=(15,6))
sns.distplot(NYE, color = 'Red', label = 'NY')
sns.distplot(SFE, color = 'Gold', label = 'SF')
sns.distplot(EUE, color = 'blue', label = 'EU')
plt.legend()
plt.show()
C:\Users\Prem\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\Prem\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning) C:\Users\Prem\anaconda3\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
from scipy.stats import f_oneway
stat, p_value = f_oneway(NYE, SFE, EUE)
if p_value <= alpha:
print(f'Since the p-value, {round(p_value, 3)} < {alpha} (alpha) the difference is significant and we reject the Null hypothesis')
else:
print(f'''\t Since the p-value, {round(p_value,3)} > {alpha} (alpha) the difference is not significant and,
\t we fail to reject the Null hypothesis''')
Since the p-value, 0.628 > 0.05 (alpha) the difference is not significant and,
we fail to reject the Null hypothesis
---------------------------------------- THANK YOU -----------------------------------